R² Score (r2_score)#

The coefficient of determination (R^2) is a regression metric that answers:

“How much better are my predictions than the simplest baseline that always predicts the mean?”

It’s scale-free and easy to interpret on a single dataset, but it has some gotchas (negative values, constant targets, and the fact that “variance explained” has conditions).


Learning goals#

By the end you should be able to:

  • derive and interpret (R^2)

  • understand why (R^2\in(-\infty, 1]) and what (R^2=0) means

  • implement r2_score from scratch in NumPy (including weights + multioutput)

  • connect maximizing (R^2) to minimizing squared error

  • track (R^2) while fitting linear regression with gradient descent

Quick import#

from sklearn.metrics import r2_score

Prerequisites#

  • mean and variance

  • squared error

  • basic regression notation

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import r2_score as sk_r2_score

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Definition: compare to the mean baseline#

Given true targets (y_1,\dots,y_n) and predictions (\hat{y}_1,\dots,\hat{y}_n):

  • residuals: (e_i = y_i - \hat{y}_i)

  • baseline prediction: the mean (\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i)

Define the sum of squared errors and the total sum of squares:

[ \text{SSE} = \sum_{i=1}^n (y_i - \hat{y}i)^2,\qquad \text{SST} = \sum{i=1}^n (y_i - \bar{y})^2 ]

Then the (R^2) score is:

[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}} ]

Interpretation:

  • (R^2 = 1): perfect predictions ((\text{SSE}=0))

  • (R^2 = 0): no improvement over predicting the mean ((\text{SSE}=\text{SST}))

  • (R^2 < 0): worse than predicting the mean ((\text{SSE}>\text{SST}))

So (R^2) is a relative-to-baseline score, not an “absolute error” measure.

# A small synthetic example: four predictors on the same y_true
n = 60
x = np.linspace(0, 10, n)
y_true = 1.5 + 0.8 * x + rng.normal(0, 1.2, size=n)

mean_pred = np.full_like(y_true, y_true.mean())

models = {
    "Perfect": y_true,
    "Good (small noise)": y_true + rng.normal(0, 0.5, size=n),
    "Mean baseline": mean_pred,
    "Worse-than-mean (anti)": 2 * y_true.mean() - y_true,  # guarantees R² = -3
}

sst = np.sum((y_true - y_true.mean()) ** 2)

rows = []
for name, y_pred in models.items():
    sse = np.sum((y_true - y_pred) ** 2)
    rows.append((name, sse, 1 - sse / sst, sk_r2_score(y_true, y_pred)))

print("SST (baseline SSE) =", sst)
print("\nModel comparison")
print("-" * 70)
for name, sse, r2_manual, r2_sklearn in rows:
    print(f"{name:24s} SSE={sse:8.2f}  R²(manual)={r2_manual:7.3f}  R²(sklearn)={r2_sklearn:7.3f}")

# Bar view: R² is a rescaling of SSE (on a fixed dataset)
names = [r[0] for r in rows]
sse_vals = np.array([r[1] for r in rows], dtype=float)
r2_vals = np.array([r[3] for r in rows], dtype=float)
sse_over_sst = sse_vals / sst

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Relative error: SSE / SST", "R² = 1 - SSE/SST"),
)

fig.add_trace(go.Bar(x=names, y=sse_over_sst, name="SSE/SST", marker_color="#E45756"), row=1, col=1)
fig.add_hline(
    y=1.0,
    line_dash="dash",
    line_color="black",
    row=1,
    col=1,
    annotation_text="mean baseline",
    annotation_position="top left",
)

fig.add_trace(go.Bar(x=names, y=r2_vals, name="R²", marker_color="#4C78A8"), row=1, col=2)
fig.add_hline(
    y=0.0,
    line_dash="dash",
    line_color="black",
    row=1,
    col=2,
    annotation_text="mean baseline",
    annotation_position="top left",
)

fig.update_yaxes(title_text="SSE / SST", row=1, col=1)
fig.update_yaxes(title_text="R²", row=1, col=2)
fig.update_xaxes(tickangle=-20, row=1, col=1)
fig.update_xaxes(tickangle=-20, row=1, col=2)
fig.update_layout(title="Same ordering, different scale", height=420, showlegend=False)
fig.show()
SST (baseline SSE) = 403.6001913242214

Model comparison
----------------------------------------------------------------------
Perfect                  SSE=    0.00  R²(manual)=  1.000  R²(sklearn)=  1.000
Good (small noise)       SSE=    9.27  R²(manual)=  0.977  R²(sklearn)=  0.977
Mean baseline            SSE=  403.60  R²(manual)=  0.000  R²(sklearn)=  0.000
Worse-than-mean (anti)   SSE= 1614.40  R²(manual)= -3.000  R²(sklearn)= -3.000
# Visual intuition: y_true vs y_pred (each panel shows a different model)
subplot_titles = [
    f"{name}<br>R² = {sk_r2_score(y_true, y_pred):.3f}" for name, y_pred in models.items()
]

fig = make_subplots(rows=2, cols=2, subplot_titles=subplot_titles)

mn = float(np.min(y_true))
mx = float(np.max(y_true))

for k, (name, y_pred) in enumerate(models.items()):
    row = 1 if k < 2 else 2
    col = 1 if (k % 2) == 0 else 2

    fig.add_trace(
        go.Scatter(x=y_true, y=y_pred, mode="markers", name=name, showlegend=False),
        row=row,
        col=col,
    )
    fig.add_trace(
        go.Scatter(
            x=[mn, mx],
            y=[mn, mx],
            mode="lines",
            line=dict(color="black", dash="dash"),
            showlegend=False,
        ),
        row=row,
        col=col,
    )

    fig.update_xaxes(title_text="y_true", row=row, col=col)
    fig.update_yaxes(title_text="y_pred", row=row, col=col)

fig.update_layout(
    title="R² compares SSE to the mean baseline (R² can be negative)",
    height=650,
)
fig.show()

Why the mean baseline makes sense#

The baseline prediction (\hat{y}_i = \bar{y}) minimizes squared error among all constant predictors.

  • Any constant (c) gives (\sum_i (y_i - c)^2).

  • The derivative w.r.t. (c) is (-2\sum_i(y_i-c)), which is 0 at (c=\bar{y}).

So (\text{SST}) is literally “the best SSE you can do without using features”.

That’s why (R^2=0) means “no better than doing nothing smarter than predicting the mean”.

# Visualize why the mean is the best constant predictor
c_grid = np.linspace(float(y_true.min()) - 2.0, float(y_true.max()) + 2.0, 200)
sse_grid = np.array([np.sum((y_true - c) ** 2) for c in c_grid])

fig = go.Figure()
fig.add_trace(go.Scatter(x=c_grid, y=sse_grid, mode="lines", name="SSE(c)"))
fig.add_vline(
    x=float(y_true.mean()),
    line_dash="dash",
    line_color="black",
    annotation_text="mean(y)",
    annotation_position="top",
)
fig.update_layout(
    title="Constant predictor: SSE is minimized at the mean",
    xaxis_title="constant prediction c",
    yaxis_title="SSE",
    height=350,
)
fig.show()

2) Sums of squares and “variance explained”#

A common phrase is that (R^2) is the “fraction of variance explained”. That’s true in a specific (important) setting:

  • you fit ordinary least squares (OLS) with an intercept.

In that case, define the explained sum of squares:

[ \text{SSR} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 ]

Then we get the classic decomposition:

[ \text{SST} = \text{SSR} + \text{SSE} ]

and so

[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = \frac{\text{SSR}}{\text{SST}}. ]

For arbitrary predictions (not OLS fitted values), the decomposition (\text{SST}=\text{SSR}+\text{SSE}) need not hold, but (R^2 = 1-\text{SSE}/\text{SST}) is still the standard definition used by sklearn.metrics.r2_score.

A useful rewrite (same dataset)#

If you define

  • (\text{MSE} = \frac{1}{n}\sum_i (y_i-\hat{y}_i)^2)

  • (\mathrm{Var}(y) = \frac{1}{n}\sum_i (y_i-\bar{y})^2) (population variance, with (n) not (n-1))

then

[ R^2 = 1 - \frac{\text{MSE}}{\mathrm{Var}(y)}. ]

This highlights something important: on a fixed dataset, maximizing (R^2) is equivalent to minimizing MSE/SSE.

# Show SST = SSR + SSE for an OLS fit with an intercept
n = 120
x = rng.uniform(-3, 3, size=n)
y = 2.0 + 1.5 * x + rng.normal(0, 1.0, size=n)

X = np.column_stack([np.ones(n), x])  # intercept + feature
w_ols, *_ = np.linalg.lstsq(X, y, rcond=None)
y_hat = X @ w_ols

y_bar = y.mean()
SST = np.sum((y - y_bar) ** 2)
SSE = np.sum((y - y_hat) ** 2)
SSR = np.sum((y_hat - y_bar) ** 2)

print("w_ols =", w_ols)
print(f"SST={SST:.3f}  SSR={SSR:.3f}  SSE={SSE:.3f}  (SST-(SSR+SSE))={SST-(SSR+SSE):.3e}")
print(f"R² (manual) = {1 - SSE / SST:.4f}")
print(f"R² (sklearn) = {sk_r2_score(y, y_hat):.4f}")
w_ols = [2.0415 1.5628]
SST=1093.626  SSR=964.266  SSE=129.360  (SST-(SSR+SSE))=4.547e-13
R² (manual) = 0.8817
R² (sklearn) = 0.8817
# Variance decomposition as a stacked bar
fig = go.Figure()
fig.add_trace(go.Bar(name="Explained (SSR)", x=["SST"], y=[SSR], marker_color="#4C78A8"))
fig.add_trace(go.Bar(name="Residual (SSE)", x=["SST"], y=[SSE], marker_color="#E45756"))

fig.update_layout(
    barmode="stack",
    title=f"OLS with intercept: SST = SSR + SSE  (R² = {1 - SSE / SST:.3f})",
    yaxis_title="Sum of squares",
    height=350,
)
fig.show()

3) From-scratch NumPy implementation#

Weighted + multioutput definitions#

For sample weights (w_i \ge 0), define the weighted mean:

[ \bar{y}w = \frac{\sum{i=1}^n w_i y_i}{\sum_{i=1}^n w_i} ]

Then (per output):

[ \text{SSE}w = \sum{i=1}^n w_i (y_i - \hat{y}_i)^2,\qquad \text{SST}w = \sum{i=1}^n w_i (y_i - \bar{y}_w)^2,\qquad R^2_w = 1 - \frac{\text{SSE}_w}{\text{SST}_w}. ]

For multioutput (y \in \mathbb{R}^{n\times m}), compute (R^2) per output and then aggregate.

Below is a small NumPy implementation that mirrors the core ideas in scikit-learn:

  • supports 1D targets ((n,)) and multioutput targets ((n, m))

  • supports sample_weight

  • supports multioutput aggregation:

    • 'raw_values' (per-output scores)

    • 'uniform_average' (simple mean)

    • 'variance_weighted' (weights each output by its (\text{SST}))

  • handles constant targets with force_finite=True (sklearn’s default):

    • perfect predictions (\Rightarrow 1.0)

    • imperfect predictions (\Rightarrow 0.0)

Note: with fewer than 2 samples, (R^2) is undefined; scikit-learn returns nan.

def r2_score_numpy(
    y_true,
    y_pred,
    *,
    sample_weight=None,
    multioutput="uniform_average",
    force_finite=True,
):
    """NumPy implementation of the R² score.

    Parameters
    ----------
    y_true, y_pred : array-like
        Shape (n_samples,) or (n_samples, n_outputs)

    sample_weight : array-like, optional
        Shape (n_samples,)

    multioutput : {'raw_values', 'uniform_average', 'variance_weighted'} or array-like

    force_finite : bool
        If True (default), replace non-finite scores for constant targets:
        - perfect predictions -> 1.0
        - imperfect predictions -> 0.0

    Returns
    -------
    score : float or ndarray
    """

    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"y_true and y_pred must have the same shape, got {y_true.shape} vs {y_pred.shape}")

    if y_true.ndim == 1:
        y_true_2d = y_true.reshape(-1, 1)
        y_pred_2d = y_pred.reshape(-1, 1)
    elif y_true.ndim == 2:
        y_true_2d = y_true
        y_pred_2d = y_pred
    else:
        raise ValueError(f"Expected 1D or 2D arrays, got ndim={y_true.ndim}")

    n_samples, n_outputs = y_true_2d.shape

    if n_samples < 2:
        # R² is undefined with fewer than 2 samples (sklearn returns nan + a warning)
        return np.nan

    if sample_weight is None:
        w = np.ones((n_samples, 1), dtype=float)
    else:
        w = np.asarray(sample_weight, dtype=float).reshape(-1, 1)
        if w.shape[0] != n_samples:
            raise ValueError(f"sample_weight must have shape (n_samples,), got {w.shape}")
        if np.any(w < 0):
            raise ValueError("sample_weight must be non-negative")

    w_sum = np.sum(w)
    if w_sum == 0:
        raise ValueError("Sum of sample_weight must be > 0")

    # Weighted mean per output
    y_bar = np.sum(w * y_true_2d, axis=0) / w_sum

    numerator = np.sum(w * (y_true_2d - y_pred_2d) ** 2, axis=0)  # SSE per output
    denominator = np.sum(w * (y_true_2d - y_bar) ** 2, axis=0)     # SST per output

    with np.errstate(divide="ignore", invalid="ignore"):
        output_scores = 1.0 - (numerator / denominator)

    if force_finite:
        # Constant targets -> denominator == 0
        mask = denominator == 0
        if np.any(mask):
            # Perfect prediction => numerator == 0 -> score 1.0
            perfect = mask & (numerator == 0)
            output_scores = output_scores.copy()
            output_scores[perfect] = 1.0
            output_scores[mask & ~perfect] = 0.0

    # Multioutput aggregation
    if multioutput == "raw_values":
        scores = output_scores
    elif multioutput == "uniform_average":
        scores = float(np.mean(output_scores))
    elif multioutput == "variance_weighted":
        # Weight by SST (denominator)
        denom_sum = float(np.sum(denominator))
        scores = float(np.sum(output_scores * denominator) / denom_sum) if denom_sum > 0 else float(np.mean(output_scores))
    else:
        # array-like weights per output
        weights = np.asarray(multioutput, dtype=float)
        if weights.shape != (n_outputs,):
            raise ValueError(f"multioutput weights must have shape (n_outputs,), got {weights.shape}")
        if np.any(weights < 0):
            raise ValueError("multioutput weights must be non-negative")
        w_out_sum = float(np.sum(weights))
        if w_out_sum == 0:
            raise ValueError("Sum of multioutput weights must be > 0")
        scores = float(np.sum(output_scores * weights) / w_out_sum)

    if n_outputs == 1 and multioutput != "raw_values":
        return float(scores)

    return scores
# Quick checks vs scikit-learn

# 1D
y_true = rng.normal(size=50)
y_pred = y_true + rng.normal(0, 0.3, size=50)

print("1D")
print("numpy :", r2_score_numpy(y_true, y_pred))
print("sklearn:", sk_r2_score(y_true, y_pred))

# Multioutput
Y_true = rng.normal(size=(80, 3))
Y_pred = Y_true + rng.normal(0, 0.5, size=(80, 3))
w = rng.uniform(0.5, 2.0, size=80)

print("\nMultioutput (raw)")
print("numpy :", r2_score_numpy(Y_true, Y_pred, sample_weight=w, multioutput="raw_values"))
print("sklearn:", sk_r2_score(Y_true, Y_pred, sample_weight=w, multioutput="raw_values"))

print("\nMultioutput (uniform_average)")
print("numpy :", r2_score_numpy(Y_true, Y_pred, sample_weight=w, multioutput="uniform_average"))
print("sklearn:", sk_r2_score(Y_true, Y_pred, sample_weight=w, multioutput="uniform_average"))

print("\nMultioutput (variance_weighted)")
print("numpy :", r2_score_numpy(Y_true, Y_pred, sample_weight=w, multioutput="variance_weighted"))
print("sklearn:", sk_r2_score(Y_true, Y_pred, sample_weight=w, multioutput="variance_weighted"))

# Constant target edge case
print("\nConstant y_true")
y_const = np.ones(5)
print("perfect (force_finite=True):", r2_score_numpy(y_const, np.ones(5), force_finite=True))
print("bad     (force_finite=True):", r2_score_numpy(y_const, np.zeros(5), force_finite=True))
print("perfect (force_finite=False):", r2_score_numpy(y_const, np.ones(5), force_finite=False))
print("bad     (force_finite=False):", r2_score_numpy(y_const, np.zeros(5), force_finite=False))
1D
numpy : 0.8907791234783493
sklearn: 0.8907791234783493

Multioutput (raw)
numpy : [0.7509 0.7757 0.7692]
sklearn: [0.7509 0.7757 0.7692]

Multioutput (uniform_average)
numpy : 0.7652668866547568
sklearn: 0.7652668866547568

Multioutput (variance_weighted)
numpy : 0.764809940184828
sklearn: 0.764809940184828

Constant y_true
perfect (force_finite=True): 1.0
bad     (force_finite=True): 0.0
perfect (force_finite=False): nan
bad     (force_finite=False): -inf

4) Using R² while optimizing a model (from scratch)#

Because (\text{SST}) depends only on (y), it is constant w.r.t. model parameters (\theta) on a fixed dataset.

[ R^2(\theta) = 1 - \frac{\text{SSE}(\theta)}{\text{SST}} ]

So:

  • maximizing (R^2) (\Leftrightarrow) minimizing (\text{SSE}) (or MSE)

  • their gradients differ only by a constant scale:

[ \nabla_\theta R^2(\theta) = -\frac{1}{\text{SST}}\nabla_\theta \text{SSE}(\theta) ]

In practice we optimize MSE/SSE (a proper loss), and use (R^2) as a training/validation score.

Example: linear regression with gradient descent#

We’ll fit a line (\hat{y} = w_0 + w_1 x) by minimizing MSE, and track (R^2) over iterations.

# Data: y = w0 + w1 x + noise
n = 200
x = rng.uniform(-2, 4, size=n)
y = 1.0 + 2.5 * x + rng.normal(0, 1.0, size=n)

X = np.column_stack([np.ones(n), x])

# Closed-form (for reference)
w_closed, *_ = np.linalg.lstsq(X, y, rcond=None)

# Gradient descent
w = np.zeros(2)
lr = 0.05
n_steps = 300

r2_hist = []
mse_hist = []
w_hist = []

for _ in range(n_steps):
    y_hat = X @ w
    err = y_hat - y

    mse = float(np.mean(err**2))
    grad = (2.0 / n) * (X.T @ err)

    r2_hist.append(r2_score_numpy(y, y_hat))
    mse_hist.append(mse)
    w_hist.append(w.copy())

    w = w - lr * grad

w_hist = np.asarray(w_hist)

print("Closed-form w:", w_closed)
print("GD final w   :", w)
print("Final R²     :", r2_hist[-1])
Closed-form w: [0.9932 2.4779]
GD final w   : [0.9932 2.4779]
Final R²     : 0.9430322406480296
# Optimization diagnostics
iters = np.arange(n_steps)

fig = make_subplots(rows=1, cols=2, subplot_titles=("R² vs iteration", "MSE vs iteration"))

fig.add_trace(go.Scatter(x=iters, y=r2_hist, mode="lines", name="R²"), row=1, col=1)
fig.update_yaxes(title_text="R²", row=1, col=1)
fig.update_xaxes(title_text="iteration", row=1, col=1)

fig.add_trace(go.Scatter(x=iters, y=mse_hist, mode="lines", name="MSE", line=dict(color="#E45756")), row=1, col=2)
fig.update_yaxes(title_text="MSE", row=1, col=2)
fig.update_xaxes(title_text="iteration", row=1, col=2)

fig.update_layout(height=350, title="Gradient descent improves MSE and (equivalently) R²")
fig.show()
# Fit visualization: data + mean baseline + fitted line
x_line = np.linspace(x.min(), x.max(), 200)
X_line = np.column_stack([np.ones_like(x_line), x_line])

y_line_gd = X_line @ w
y_line_closed = X_line @ w_closed

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name="data", marker=dict(size=6, opacity=0.6)))
fig.add_trace(go.Scatter(x=x_line, y=np.full_like(x_line, y.mean()), mode="lines", name="mean baseline", line=dict(dash="dash", color="gray")))
fig.add_trace(go.Scatter(x=x_line, y=y_line_closed, mode="lines", name="closed-form", line=dict(color="#4C78A8", width=3)))
fig.add_trace(go.Scatter(x=x_line, y=y_line_gd, mode="lines", name="gradient descent", line=dict(color="#F58518", width=3)))

fig.update_layout(
    title=f"Linear regression fit (R² ≈ {r2_score_numpy(y, X @ w):.3f})",
    xaxis_title="x",
    yaxis_title="y",
    height=450,
)
fig.show()

5) Pros, cons, and when to use#

Pros#

  • Baseline-relative: interpretable against a meaningful reference (predicting (\bar{y}))

  • Scale-free: shifting/scaling (y) (and predictions accordingly) does not change (R^2)

  • Widely used: default .score() for many sklearn regressors

Cons / pitfalls#

  • Can be negative: a model can be arbitrarily worse than the mean baseline

  • Not defined for constant targets: if (\text{SST}=0), sklearn uses force_finite=True to map (nan, -inf) to (1.0, 0.0)

  • “Variance explained” is conditional: the (\text{SST}=\text{SSR}+\text{SSE}) story holds for OLS with an intercept, not arbitrary predictions

  • Can reward overfitting: on training data, (R^2) almost always increases as you add features (even useless ones)

  • Outlier-sensitive: uses squared error, so large residuals dominate

When it’s a good choice#

  • comparing regression models on the same target and same dataset split

  • reporting an intuitive “better than mean?” summary alongside an absolute metric (MAE/RMSE)

  • model selection with cross-validated (R^2) rather than training (R^2)

When to be careful#

  • comparing across datasets with very different target variance

  • heavy-tailed noise / many outliers (consider MAE, Huber loss, or robust metrics)

  • non-stationary time series (always evaluate out-of-sample; consider rolling/forward validation)

6) Exercises#

  1. Show that predicting (\bar{y}) is the best constant predictor by plotting SSE as a function of constant (c).

  2. Create a case where (R^2) is strongly negative and explain it in terms of SSE vs SST.

  3. Implement adjusted (R^2):

[ \bar{R}^2 = 1 - (1 - R^2)\frac{n-1}{n-d-1} ]

and compare it to (R^2) while adding random (useless) features. 4. Compute cross-validated (R^2) for a linear model vs a constant baseline.

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

  • Wikipedia: https://en.wikipedia.org/wiki/Coefficient_of_determination